Athletes Data Analysis

About the Data

This data was sourced using SPARQL query on 17th August 2020. See the query below:

PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX dbr: <http://dbpedia.org/resource/> PREFIX dbp: <http://dbpedia.org/property/> PREFIX ling: <http://purl.org/linguistics/gold/> SELECT DISTINCT ?a, ?dob, ?ht, ?hpn, ?g, ?name, ?c, ?intro WHERE{ ?a a dbo:Athlete; dbo:birthDate ?dob; dbo:height ?ht; ling:hypernym ?hpn; foaf:gender ?g; foaf:name ?name; dbo:abstract ?intro. OPTIONAL{?a dbo:country ?c} FILTER(LANG(?name) = "en"). }

The format and structures of the uris are evolving. As of 14th May, 2021, ling and foaf:gender are unavailable.

The updated query is :

PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX dbr: <http://dbpedia.org/resource/> PREFIX dbp: <http://dbpedia.org/property/> SELECT DISTINCT ?a, ?dob, ?ht, ?name, ?c, ?intro WHERE{ ?a a dbo:Athlete; dbo:birthDate ?dob; dbo:height ?ht; foaf:name ?name; dbo:abstract ?intro. OPTIONAL{?a dbo:country ?c} FILTER(LANG(?name) = "en"). }

The following analysis is for only demonstrating how DBpedia's data can be leveraged.

Data Explorations

Since there are no other language apart from english, we can drop these columns.

If this link is referenced, we can see their properties. For example, if this link http://dbpedia.org/page/Gerard_van_Velde is visitied, the properties of this dbpedia resource would come up and these properties that have been scraped can also be verified.

Obs:

Country is has a lot o missing values

Date of Birth has inconsistant date format and therefore there are duplicates from the source i.e. DBpedia

Obs

Got rid of all the duplicates here

Wordcloud of Sports by Male and Female

Gender Count

Heights Distribution by Gender

info

I tried plotting this earlier but the boxplot did not seem alright. So, figured out that the column datatype was object so the boxplot was weird. Hence, changed it to float type.

Obs

The outliers are definitely wrong numbers. I am assuming they are not in the right scale. So I divided them by 100, the ones which are more than 100. Height, possibly cannot be as low as .5 m

Final Thoughts